56 research outputs found

    Approximate Membership for Regular Languages modulo the Edit Distance

    Get PDF
    International audienceWe present a probabilistic algorithm for testing approximate membership of words to regular languages modulo the edit distance. The time complexity of our algorithm, which is independent of the size of the input word, is polynomial in the size of the input automaton and the inverse error precision. All previous property testing algorithms for regular languages, whether they consider approximations modulo the Hamming distance or the edit distance with moves, run in exponential time if not fixing one of these parameters

    Learning n-ary Node Selecting Tree Transducers from Completely Annotated Examples

    Get PDF
    International audienceWe present the first algorithm for learning n-ary node selection queries in trees from completely annotated examples by methods of grammatical inference. We propose to represent n-ary queries by deterministic n-ary node selecting tree transducers (NSTTs), that are known to capture the class of MSO-definable n-ary queries. Despite of this highly expressive, we show that n-aryy queries, selecting a polynomially bounded number of tuples per tree, represented by deterministic NSTTs can be learned from polynomial time and data while allowing for efficient enumeration of query answers. An application to wrapper induction in Web information extraction yields encouraging results

    Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas

    Get PDF
    Special issue of LATA'08.International audienceWe present algorithms for testing language inclusion L(A) ⊆ L(B) between tree automata in time O(|A| |B|) where B is deterministic (bottom-up or top-down). We extend our algorithms for testing inclusion of automata for unranked trees A in deterministic DTDs or deterministic EDTDs with restrained competition D in time O(|A| |Σ| |D|). Previous algorithms were less efficient or less general

    gMark: Schema-Driven Generation of Graphs and Queries

    Full text link
    Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this paper, we present the design and engineering principles of gMark, a domain- and query language-independent graph instance and query workload generator. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of gMark by showcasing the framework's capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.Comment: Accepted in November 2016. URL: http://ieeexplore.ieee.org/document/7762945/. in IEEE Transactions on Knowledge and Data Engineering 201

    Schema-Guided Induction of Monadic Queries

    Get PDF
    International audienceThe induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML

    Identification of biRFSA languages

    Get PDF
    International audienceThe task of identifying a language from a set of its words is not an easy one. For instance, it is not feasible to identify regular languages in the general case. Therefore, looking for subclasses of regular languages that can be identi?ed in this framework is an interesting problem. One of the most classical identi?able classes is the class of reversible languages, introduced by D. Angluin, also called bideterministic languages as they can be represented by deterministic automata (DFA) whose reverse is also deterministic. Residual Finite State Automata (RFSA) on the other hand is a class of non deterministic automata that shares some properties with DFA. In particular, DFA are RFSA and RFSA can be much smaller. We study here learnability of the class of languages that can be represented by biRFSA: RFSA whose reverse are RFSA. We prove that this class is not identi?able in general but we present two subclasses that are learnable, the second one being identi?able in polynomial time

    Efficient Inclusion Checking for Deterministic Tree Automata and DTDs

    Get PDF
    International audienceWe present a new algorithm for testing language inclusion L(A) ⊆ L(B)L(A) between tree automata in time O(|A| |B|) where B is deterministic. We extend this algorithm for testing inclusion between automata for unranked trees A and deterministic DTDs D in time O(|A| |Σ| |D|). No previous algorithms with these complexities exist. A journal extension is available at http://hal.inria.fr/inria-00366082

    Query Induction with Schema-Guided Pruning Strategies

    Get PDF
    International audienceInference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schema-guided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction

    Learning Top-Down Tree Transducers with Regular Domain Inspection

    Get PDF
    International audienceWe study the problem of how to learn tree transformations on a given regular tree domain from a finite sample of input-output examples. We assume that the target tree transformation can be defined by a deterministic top-down tree transducer with regular domain inspection (DTOPi:reg). An RPNI style learning algorithm that solves this problem in polynomial time and with polynomially many examples was presented at Pods'2010, but restricted to the case of path-closed regular domains. In this paper, we show that this restriction can be removed. For this, we present a new normal form for DTOPi:reg by extending the Myhill-Nerode theorem for DTOP to regular domain inspections in a nontrivial manner. The RPNI style learning algorithm can also be lifted but becomes more involved too

    Sublinear DTD Validity

    Get PDF
    International audienceWe present an efficient algorithm for testing approximate DTD validity modulo the strong tree edit distance. Our algorithm inspects XML documents in a probabilistic manner. It detects with high probability the nonvalidity of XML documents with a large fraction of errors, measured in terms of the strong tree edit distance from the DTD. The run time depends polynomially on the depth of the XML document tree but not on its size, so that it is sublinear in most cases. Therefore, our algorithm can be used to speed up exact DTD validators that run in linear time. We also prove a negative result showing that the run time of any approximate DTD validity tester must depend on the depth of the input tree. A long version is available here.</p
    • …
    corecore